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DISCUSSION OF “FREQUENTIST COVERAGE OF ADAPTIVE 
NONPARAMETRIC BAYESIAN CREDIBLE SETS” 

By Subhashis Ghosal 
North Carolina State University 

First I would like to congratulate the authors Botond Szabo, Aad van der 
Vaart and Harry van Zanten for a fine piece of work on the extremely impor¬ 
tant topic of frequentist coverage of adaptive nonparametric credible sets. 
Credible sets are used by Bayesians to quantify uncertainty of estimation, 
which is typically viewed as more informative than point estimation. Such 
sets are often easily constructed, for instance, by sampling from the pos¬ 
terior, while confidence sets in the frequentist setting may need evaluating 
limiting distributions, or resampling, which needs additional justification. 
Bayesian uncertainty quantification in parametric problems from the fre¬ 
quentist view is justified through the Bernstein-von Mises theorem. In recent 
years, such results have also been obtained for the parametric part in certain 
semiparametric models, guaranteeing coverage of Bayesian credible sets for 
it. However, as mentioned by the authors, inadequate coverage of nonpara¬ 
metric credible sets has been observed [Cox (1993), Freedman (1999)] in the 
white noise model, arguably the simplest nonparametric model. A clearer 
picture emerged after the work of Knapik, van der Vaart and van Zanten 
(2011) that undersmoothing priors can resolve the issue of coverage; see also 
Leahu (2011) and Castillo and Nickl (2013). 

In the present paper, the authors address the issue of coverage of cred¬ 
ible sets in a white noise model under the inverse problem setting, when 
the underlying smoothness (i.e., regularity) of the true parameter is not 
known, so a procedure must adapt to the smoothness. The authors follow 
an empirical Bayes approach where a key regularity parameter in the prior is 
estimated from its marginal likelihood function. As the authors mentioned, 
undersmoothing leads to inferior point estimation and is also difficult to 
implement when the smoothness of the parameter is not known. We shall 
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see that the issue of coverage can also be addressed by two other alternative 
approaches. 

Before entering a discussion on the contents of the paper, let us take an¬ 
other look at the coverage problem for Bayesian credible sets in an abstract 
setting. Suppose that we have a family of experiments based on observations 
and indexed by a parameter 0 G 0, some appropriate metric space. Let 
e„ be the minimax convergence rate for estimating 0. Let 7 „ G [0,1] be a 
sequence which can be fixed or may tend to 0. For some ^ oo, typically 
a slowly varying sequence, the goal is to find a subset C{Y^'^'>) C 0 such that 
uniformly on 

(i) n(0GC(yW)|yW)>i-7n, 

(ii) pW(0oGC(yW))^i, 

(hi) diam(C(y(”))) = O („)(m„e„), 

where B varies over a class of compact balls in 0. 

In the formulation, credibility may increase with the sample size. We 
find it natural that when the information content is increasing, a researcher 
should quantify uncertainty with more and more confidence, instead of stay¬ 
ing at a fixed level, just like one seeks for more precise point estimators or 
tests. If 7 n —)■ 0, it can be seen that the problems of mismatch of credibil¬ 
ity and coverage pointed out in Cox (1993) and Freedman (1999) go away. 
Thus, although the uncertainty quantification of a Bayesian and a frequen- 
tist may not match at finite levels, they do match at the infinitesimal level. 
For hner matching, one may also like to impose some requirement on how 
fast Pq^\9q GC(y("'^)) should approach 1, but we shall forgo the issue in 
this discussion. Another approach is to obtain a (1 — 7 „)-credible ball around 
the posterior mean typically with fixed and inflate the region by a factor 
nin, to be called the inflation factor, to ensure adequate frequentist cover¬ 
age. The size of the original credible region is typically of the order of the 
minimax convergence rate so that the third condition will be met. The 
factor nin can be considered as a reasonable price for the increased level 
of coverage. Typically, the resulting extra cost rrin is low, for instance, in 
an asymptotic normality setting, while adopting (1 — 7 „)-credible sets with 
7 n —)• 0, the additional cost is rUn = o(y^log(l/ 7 „)). In the setting we shall 
discuss, the inflation factor may be taken as a sufficiently large constant. 
The supremum over compact sets in the formulation imposes honesty of the 
coverage. 

As mentioned by the authors, fully adaptive honest nonparametric confi¬ 
dence regions are not possible by any means, so in the adaptive context 0 
will be replaced by an appropriate subset of the parameter space, such as 
the set of self-similar sequences or polished tailed sequences in the context of 
the paper. The concept of polished tail is pretty elegant as it blends nicely 



DISCUSSION 


3 


in the adaptive setting without any direct reference to the smoothness of 
the parameter. 

The main result proved in the paper, namely, honest coverage of adaptive 
posterior credible regions for 0 = ( 01 , 02 , • • ■) in the model Yi = Ki9i + n~^^'^£i, 

where 6 ^ and £i A^(0,1), for all polished tail sequences is certainly ex¬ 
citing. In terms of the equivalent (and perhaps more directly relevant) white 
noise inverse problem model dY{t) = Kf(t) dt + n~^^'^ dW{t), this translates 
into honest coverage of credible regions for / through Parseval’s identity, 
where the distance on / is measured in terms of the L 2 -distance. However, 
L 2 -regions for functions do not look like bands, and may be a little harder 
to visualize. This aspect may be relevant for covering a true function that 
has a bump like the one given by equation (4.1) in the paper under discus¬ 
sion, since L 2 -closeness does not even imply pointwise closeness, let alone 
uniform closeness. Regions on function spaces given by Loo-neighborhoods 
are easier to visualize and interpret. Moreover, such uniform closeness has 
other implications. For instance, if derivatives of the functions in a region are 
uniformly e^-close to the derivative of the true function and the derivative of 
the true function has a well separated mode, then the mode of a function in 
that region is 0(en)-close to the true mode. The observation can be used to 
induce honest confidence regions for the mode from those for the derivative 
function under the Loo-distance. 

Study of coverage of Loo-regions with chosen credibility needs studying 
posterior contraction rates under the Loo-norm, which is easier if conjugacy 
is present, like in the white noise model or nonparametric regression using 
a random series with normal coefficients. Below we shall argue that in the 
white noise model credible regions for the Loo-norm can also be character¬ 
ized and computed relatively easily, and their coverage can be shown to 
be adequate. Interestingly, we can use a fixed level of credibility (any value 
higher than I/2 works) and the inflation factor can be taken to be a constant. 
We shall follow techniques similar to those used in Yoo and Ghosal (2014), 
who considered the problem of multivariate nonparametric regression us¬ 
ing a random series of tensor product B-splines in the known smoothness 
setting. In a sense the present treatment of the simpler white noise model 
will be easier, but there are certain differences as well, particularly since 
the number of basis elements used in constructing the prior is infinite in 
the present case, unlike the case treated by Yoo and Ghosal (2014). For the 
sake of simplicity of the discussion, we focus on the direct problem, that 
is, Ki = 1, and consider a Fourier basis (t>i{x) = 1, = \/2cos(27rzx), 

4>2i+i{x) = \/2sin(27rzx), z = 1,2,_Let the true function be denoted by 

/o and the true sequence by Oq = (0oi, 0o2, • • •)• Since we intend to study 
the Loo-contraction rate and coverage of Loo-regions, we need to assume 
that the true function /o belongs to a Holder class or, stated in terms of 
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coefficients < oo, which is stronger than the analogous Sobolev 

condition < oo. The logarithmic factor we obtain in the rate is 

not optimal—it is off by the factor (logn)^/^^^"'’'^^ Using a more rehned 
analysis or perhaps using a different basis like B-splines or wavelets, an op¬ 
timal logarithmic factor may be obtained as in Yoo and Ghosal (2014) or 
Gine and Nickl (2011). Also, because we use a Fourier basis, we assume that 
the true function is periodic, but this does not dampen the essential spirit 
of the argument. 

Consider the white noise model dY{t) = f{t)dt + dW{t) and its 
equivalent normal sequence model 1) = + where Yi = f <f)i{x) dY (x), 

0i = f (j)i{x)f{x) dx and £i = f (t>i{x) dW (x), i = 1,2,.... Let the prior 11 be 

defined by 0* A^(0,z“^“+^). Let / = E(/|Z)„), where Dn stands for the 

data. Note that /(x) = where Ot = Y{9i\Yi) = nYij+n), 

and var(6*jiyj) = -|-n)“^. Let B(q;, i?) = {/: ^^^ < i?}. Below 

we shall write for inequality up to a constant and “x” for equality in 
order. 

Theorem 1. For any Mn -)■ oo, Ef^^Uaif :\\f - /o||oo > M„e„|Dn) -)■ 

0 uniformly for all /o S B{a,R), where e„ = 7 ^-“/( 2 o+i)y^logn and for a 
sufficiently large constant M > 0, PfoiHfo — f\\oo < Mh^} —>■ 1 where hn is 
determined by Ea{f '■ ||/ — /||oo < hn\Dn) = 1 — 7 , 7 > 1/2 is a predetermined 
constant. Moreover, . 


The theorem implies that the (1 — 7 )-credible region for Loo-distance 
around the posterior mean for any 7 < 1/2 inflated by a sufficiently large 
factor M has asymptotic coverage 1 and its size hn is not larger than the 
posterior contraction rate, which is nearly optimal. It is interesting to note 
that hn is actually deterministic since the posterior distribution of / — / is 
free of the observations. Analytical computation of hn may be difficult, but 
can be easily determined by simulations. 


Proof of Theorem 1. We have /(x) = Thus, given D„, 

Z = f — f is a, mean-zero Gaussian process with covariance kernel 


cov 'y'9i4>i{s),'y'ei4>i{t) 


00 

E' 

2 = 1 


Dr. 


2 = 1 


E{\Z{s) - Z{t)\^\Dn) = ^var( 6 »i|D„)|())j(s) - fftf^ 

2 = 1 


and 
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i=l 

<^2(a-l)/(2a+l)|^_^|2 

by standard estimates and the fact \4>i{s) — 4>i{t)\ < 2y/2Tri\s — t|, a conse¬ 
quence of the mean value theorem and the boundedness of trigonometric 
functions. Using a uniform grid with mesh-width 6n x n~P for p > 0 suf¬ 
ficiently large and a chaining argument for Gaussian processes with val¬ 
ues of Z at the chosen grid points, Lemma 2.2.2 and Corollary 2.2.8 of 
van der Vaart and Wellner (1996) give the estimate E||Z||oo < a/E||Z||^ < 
n”"/ y/\ogn. 

Let V{x) = f{x)— = Y^^iy/nei<pi{x)/{i^'^^^ + n). Then U is a 

mean-zero Gaussian process with covariance kernel + n)~‘^ x 

(t>i{s)<pi{t) and 

OO 

E|U(s) -V{t)\^ = 

i=l 


Arguing as before, it follows that Ej(j||U||oo < n “/G" +^)-v/Iog n. 

Now using the uniform boundedness of the basis functions and 
SSi *"l%l < R-, uniformly for /o € B{a, R), we have for any k, Yli>k I^Oil < 
Rk~°^. Therefore, 


l|E/o/-/olloo 



n 

j2a+l _|_ 


0Oi 4’i 


OO 


< V2R 


^a+l 

n 





by choosing k = ka^ 

Combining the three pieces, it follows using Chebyshev’s inequality that 
the posterior contraction rate under the Loo-distance is e^. 

Now we find a lower bound for the size of the credible region. By definition 
hn, the (1 — 7 )-quantile of the distribution of ||.^||oo for the mean-zero Gaus¬ 
sian process Z with covariance kernel +n)~^(l)i{s)(j)i{t) is at least 

as large as the median of the distribution of ||.Z’||oo- Now a'^ = supE|Z(t)p is 
easily seen to be Since E||Z||^ > cr^, standard facts about 

Gaussian processes imply that E||Z||oo and the median of ||.^||oo are of the 
same order [cf. Ledoux and Talagrand (1991), pages 52 and 54]. Hence, to 
find a lower bound for it suffices to lower bound E||Z||oo. We shall show 
that the order of the lower bound is n““/(2“+i) .^log n. 
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To this end, we observe that ||^||oo > iiiax{Z(j'//cQ) : j = 1,..., ka}, and 

OO 

E|Z07fc„) - Z{l/k^)\^ = + n)-^\Mj/ka) - Ml/ka)\^- 

i=l 

With a sufficiently small fixed e > 0, there exists a <5 > 0 such that | sins — 
sint| > e if |s — 1| > 5 and |s + 1 — 7r| >6, and a similar assertion holds for 
the cosine function. Therefore, it is observed that for j,l = 1,... ,ka, j 7 ^ I, 
4>i{j/ka) and 4>i{l/ka) differ by at least a fixed positive number for a positive 
fraction of f G {2,..., A:„}. From this we obtain that there exists c > 0 such 
that 

E\Z{j/k^) - Z{l/k^f > cn-2“/(2«+i). 

Let Uj = Z{jjka), j = l,...,ka, so that E{Uj — UiY > 

E{Vj — Vif, where Ti,...,14^ A^(0,1). Hence, by Slepian’s inequality 

[cf. Corollary 3.14 of Ledoux and Talagrand (1991)] and equation (3.14) of 
Ledoux and Talagrand (1991), we obtain 

E^max[/j^ > E^max > y^log ka x ^logn, 

which upon rescaling gives E||Z||oo > E{maxZ(j/ka)) > n““/(2 q;+i). y/logre. 

Now turning to coverage, the lack of coverage of the credible set inflated 
by a sufficiently large constant M is given by 

PfoiWh - /Iloo > Mhn] < P{||E|| > M'en - ||E/- /olloo} 

< ^ 0 

by virtue of Borell’s inequality [cf. second assertion of Proposition A.2.1 of 
van der Vaart and Wellner (1996)], since sup^ var(Z(t)) < and 

Cn X n-“/( 2 “+i).^log n, uniformly for /q G B{a,R), where M’ and C are 
positive constants. 

Finally, we estimate the size of the inflated credible region. For that 
we need to find an upper bound for the (1 — 7 )-quantile of the distri¬ 
bution of Halloo given Zl„. By Borell’s inequality [cf. third assertion of 
Proposition A.2.1 of van der Vaart and Wellner (1996)], it is clear that the 
(1 — 7 )-quantile is bounded by y^SEjjZjj^ log( 2 / 7 ), which is of the order 
n-“/( 2 "+i) Y^log n. □ 

We also wish to study the coverage problem for Loo-credible regions when 
the regularity a is not known. Consider the empirical Bayes device of the 
paper under discussion and assume that the true sequence has a polished 
tail. The heuristic arguments given below seem to indicate that the credible 
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region constructed by plugging in the empirical Bayes estimate of a should 
have adequate coverage. 

Because we deal with various values of a simultaneously, let us include a 
in the notation IIq for the prior, and for the Gaussian processes intro¬ 
duced in the proof, and en,a = (2q:-i-i) y/log n for the sup-norm posterior 

contraction rate. We observe that €^,0 is decreasing in a. By Theorem 5.1 
of the paper under discussion, it follows that the empirical Bayes estimate 
d of a lies, with high probability, between two deterministic bounds a and 
o, and that 

In the proof of the result on coverage of the credible region, one needs 
to lower bound the radius of the credible ball around the estimate and 
show that its order is at least as large as the convergence rate of the point 
estimator given by the center of the credible region. When a is plugged in, 
the radius of the credible region is of the order of the expected value of 
the supremum of the Gaussian process Z^. The randomness of this process 
comes from posterior variation conditioned on the sample, and hence a can 
be considered as a constant. Therefore, as argued in the proof of the theorem, 
radius of the credible region is of the order x en,a ^ ^n,a- 

The sampling error of the Bayes estimator /„ using IIq has two parts— 
variability around its expectation and its bias. Now for any G [0,1], 
'E\Za{t) — Za{s)\^ is decreasing in ce, so, by Slepian’s inequality, 

SUp{E|| — B|| .^^1100 ^ Cn,a ^ 

and hxed quantiles of ||.^a||cx) also have the same order as the expectation 
of ll^alloo by Borell’s inequality. On the other hand, the bias of /« increases 
with a, and hence its maximum is attained at a for a < a < a. Note that if 
a underestimates the true a, then the order of the bias is x and 
so for every a lying in the range [a, a \, the posterior contraction rate would 
be the same. Lemma 3.11 seems to indicate that this may be the case. This 
will ensure adequate coverage of the empirical Bayes credible set. 

Another issue that might be of interest for future investigation is the han¬ 
dling of unknown variance. In the nonadaptive setting, both empirical and 
hierarchical Bayes approaches can fruitfully address the issue of unknown 
variance as demonstrated by Yoo and Ghosal (2014) for nonparametric re¬ 
gression. In the adaptive setting, this is somewhat unclear, as the empirical 
Bayes estimate of smoothness and variance will depend on each other. 

It is also natural to ask if the hierarchical Bayes credible sets can also have 
adequate coverage in the adaptive setting. This may not have an affirmative 
answer, as indicated by Rivoirard and Rousseau (2012). 

Finally, for other curve estimation problems like density estimation or 
nonparametric regression, what should be a proper analog of conditions 
like self-similarity of polished tail, and how may that help in establishing 
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coverage? The nonparametric regression problem may be more tractable 
than the density estimation, since for the former a basis expansion approach 
reduces the function of interest to a sequence of real-valued parameters which 
are typically given normal priors as well and conjugacy holds in the model. 
Usually it is more convenient to use a truncated series expansion, but then 
the sequence of parameters form a triangular array. It seems that the main 
challenge will be to identify a proper analog of a condition on the tail of the 
sequence in such a setting. 
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